Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

outlier: tooling for success rate ejection #618

Merged
merged 17 commits into from
Mar 31, 2017
Merged

Conversation

junr03
Copy link
Member

@junr03 junr03 commented Mar 24, 2017

No description provided.

* -1 means that the host did not have enough request volume to calculate success rate
* or the cluster did not have enough hosts to run through success rate outlier ejection.
*/
virtual double successRate() const PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these should come directly out of the outlier detector sink via here: https://github.com/lyft/envoy/blob/ce8eb8634f0c6aa24485e17fbd563ecc18d73982/include/envoy/upstream/host_description.h#L51

Don't add to host object.

@@ -35,6 +35,7 @@ modify different aspects of the server.
weight, Integer, Load balancing weight (1-100)
zone, String, Service zone
canary, Boolean, Whether the host is a canary
success_rate, Double, Request success rate (0-100). -1 if there was not enough request volume to calculate it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you link to the info about "not enough request volume"?

@@ -65,8 +78,9 @@ class EventLogger {
* Log an ejection event.
* @param host supplies the host that generated the event.
* @param type supplies the type of the event.
* @param enforced is true if the ejection took place, false if only logging took place.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run on sentence.

@@ -266,6 +271,9 @@ double Utility::successRateEjectionThreshold(
variance /= valid_success_rate_hosts.size();
double stdev = std::sqrt(variance);

// Set the Detector's member variable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: full stop.

@@ -48,6 +48,19 @@ class DetectorHostSink {
* @return the last time this host was unejected, if the host has been unejected previously.
*/
virtual const Optional<SystemTime>& lastUnejectionTime() PURE;

/**
* @return the success rate of the host in the last calculated interval, in the range -1-100.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1-100 is a bit misleading, as it's [0;100] with -1 as a separate value. I like the way you cover this in docs, maybe do the same here.

@@ -35,6 +35,9 @@ modify different aspects of the server.
weight, Integer, Load balancing weight (1-100)
zone, String, Service zone
canary, Boolean, Whether the host is a canary
success_rate, Double, "Request success rate (0-100). -1 if there was not enough
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to some docs here (or write it here) that describe what interval this is over.

* Set the success rate of the host.
* @param new_success_rate the new success rate calculated for the host in the last interval.
*/
virtual void successRate(double new_success_rate) PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't need to be in public interface.

@@ -241,7 +245,8 @@ void DetectorImpl::onConsecutive5xxWorker(HostSharedPtr host) {
const double Utility::SUCCESS_RATE_STDEV_FACTOR = 1.9;

double Utility::successRateEjectionThreshold(
double success_rate_sum, const std::vector<HostSuccessRatePair>& valid_success_rate_hosts) {
double success_rate_sum, const std::vector<HostSuccessRatePair>& valid_success_rate_hosts,
double& success_rate_average) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing an out param, can you return a struct. It's easier to grok what is going on.

@@ -330,7 +348,8 @@ void DetectorImpl::runCallbacks(HostSharedPtr host) {
}
}

void EventLoggerImpl::logEject(HostDescriptionConstSharedPtr host, EjectionType type) {
void EventLoggerImpl::logEject(HostDescriptionConstSharedPtr host, EjectionType type,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be useful to log some of the SR stats here also to see the SR, the cutoff, etc.?

@@ -300,6 +301,8 @@ class Cluster : public virtual HostSet {
*/
virtual ClusterInfoConstSharedPtr info() const PURE;

virtual Outlier::DetectorSharedPtr outlierDetector() const PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment, this should return a pointer to a const detector.

@@ -246,7 +255,8 @@ class Utility {
*/
static double
successRateEjectionThreshold(double success_rate_sum,
const std::vector<HostSuccessRatePair>& valid_success_rate_hosts);
const std::vector<HostSuccessRatePair>& valid_success_rate_hosts,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment

Upstream::Outlier::DetectorSharedPtr outlier_detector,
Buffer::Instance& response) {
if (outlier_detector) {
response.add(fmt::format("{}::outlier::success_rate_average::{}", cluster_name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put all these into docs?

If ``action`` is ``eject``, species the type of ejection that took place. Currently this can
only be ``5xx``.
If ``action`` is ``eject``, specifies the type of ejection that took place. Currently this can be:
``5xx``, and ``SuccessRate``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No comma after 5xx

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought you were a supporter of the oxford comma ;)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when there are 3 or more things, not 2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Comma still needs to be deleted)


host_success_rate
If ``action`` is ``eject``, and ``type`` is ``SuccessRate``, specifies the host's success rate
at the time of the ejection event on a ``0-100`` range. A ``-1`` value means that a success
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 can only happen if type is 5xx, right? This is kind of confusing. Can we just not write out host_succes_rate in the case of 5xx? Same for all the ones below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 can also happen if the type is success rate, i.e if there is no enough request volume. However, you are right it would never happen in the logs. Because, if a host is ejected it is because there was enough data to do that. -1 would happen in the admin /clusters endpoint, not here.

zone, String, Service zone
canary, Boolean, Whether the host is a canary
success_rate, Double, "Request success rate (0-100). -1 if there was not enough
:ref:`request volume<config_cluster_manager_cluster_outlier_detection_success_rate_request_volume>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you actually look at these docs rendered? This is almost definitely not correct and will look broken.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I rendered them and they look correct

* @return a shared pointer to the cluster's outlier detector. If an outlier detector has not been
* installed, returns a nullptr.
*/
virtual const Outlier::DetectorSharedPtr outlierDetector() const PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return const shared pointer has no meaning. You need to return a pointer to a const Detector.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see this is a const pointer not a pointer to a const object. What we want is to return a pointer to a const object that way we can't change the object through the pointer.

host->cluster().name(), host->address()->asString(),
typeToString(type), host->outlierDetector().numEjections(), enforced,
host->outlierDetector().successRate(), detector.successRateAverage(),
detector.successRateEjectionThreshold()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add break here for next person that comes along.

@@ -237,14 +247,21 @@ class EventLoggerImpl : public EventLogger {
*/
class Utility {
public:
struct EjectionPair {
EjectionPair(double success_rate_average, double ejection_threshold)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need constructor here. You can just do aggregate init ( {...} )

/**
* This function returns the success rate threshold for success rate outlier detection. If a
* host's success rate is under this threshold the host is an outlier.
* @param success_rate_sum is the sum of the data in the success_rate_data vector.
* @param success_rate_data is the vector containing the individual success rate data points.
* @param valid_success_rate_hosts is the vector containing the individual success rate data
* points.
* @return the success rate threshold.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment wrong

@mattklein123
Copy link
Member

Looks good other than comma still in docs. Do you have 100% coverage?

"host_success_rate": "...",
"cluster_success_rate_average": "...",
"cluster_success_rate_ejection_threshold": "..."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line

If ``action`` is ``eject``, species the type of ejection that took place. Currently this can
only be ``5xx``.
If ``action`` is ``eject``, specifies the type of ejection that took place. Currently this can be:
``5xx`` and ``SuccessRate``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second sentence doesn't make sense. Can you rewrite it? Also colon after be isn't needed.

}

time
The time that the event took place.

secs_since_last_action
The time in seconds since the last action (either an ejection or unejection)
took place. This time will be -1 for the first ejection given there is no
took place. This time will be ``-1`` for the first ejection given there is no
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this time will be to the value will be

reason and then re-added).

enforced
If ``action`` is ``eject``, specifies if the ejection was actually performed (``true``), or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sentence is too long and confusing. Please break it up.

@@ -95,19 +96,22 @@ class DetectorHostSinkImpl : public DetectorHostSink {
: detector_(detector), host_(host) {
// Point the success_rate_accumulator_bucket_ pointer to a bucket.
updateCurrentSuccessRateBucket();
success_rate_ = -1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move to initializer list

/**
* This function returns the success rate threshold for success rate outlier detection. If a
* host's success rate is under this threshold the host is an outlier.
* This function returns the an EjectionPair for success rate outlier detection. The pair contains
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra the after returns

* host's success rate is under this threshold the host is an outlier.
* This function returns the an EjectionPair for success rate outlier detection. The pair contains
* the average success rate of all valid hosts in the cluster, and the ejection threshold.
* If a host's success rate is under this threshold the host is an outlier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit full stop at the end.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma after the first sentence

* @param success_rate_data is the vector containing the individual success rate data points.
* @return the success rate threshold.
* @param valid_success_rate_hosts is the vector containing the individual success rate data
* points.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit formating.


// Need to update the writer bucket to keep the data valid.
host.second->updateCurrentSuccessRateBucket();
// Refresh host success rate stat for the /clusters endpoint. If there is a new valid value it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma after the first sentence in the if sentence.

* This function returns the success rate threshold for success rate outlier detection. If a
* host's success rate is under this threshold the host is an outlier.
* This function returns the an EjectionPair for success rate outlier detection. The pair contains
* the average success rate of all valid hosts in the cluster, and the ejection threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no comma needed before the and

@@ -277,6 +281,10 @@ void DetectorImpl::processSuccessRateEjections() {
std::vector<HostSuccessRatePair> valid_success_rate_hosts;
double success_rate_sum = 0;

// Reset the Detector's success rate mean and stdev
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit full stop

@@ -212,6 +219,8 @@ class DetectorImpl : public Detector, public std::enable_shared_from_this<Detect
std::list<ChangeStateCb> callbacks_;
std::unordered_map<HostSharedPtr, DetectorHostSinkImpl*> host_sinks_;
EventLoggerSharedPtr event_logger_;
double success_rate_average_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need default values for these?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

including success_rate_ejection_threshold_

virtual void logUneject(HostDescriptionConstSharedPtr host) PURE;
};

typedef std::shared_ptr<EventLogger> EventLoggerSharedPtr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: new line after type def

Copy link
Contributor

@ccaraman ccaraman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good after few new line nits

@junr03 junr03 merged commit 0f5d9f5 into master Mar 31, 2017
@junr03 junr03 deleted the outlier-sr-tooling branch March 31, 2017 18:25
vijayendrabvs pushed a commit to vijayendrabvs/envoy that referenced this pull request Apr 4, 2017
rshriram pushed a commit to rshriram/envoy that referenced this pull request Oct 30, 2018
* Integrate with new mixer control library.

* Update to use splitted control lib.

* remove quota config files.

* Update per comment
jplevyak pushed a commit to jplevyak/envoy that referenced this pull request Aug 17, 2020
jpsim pushed a commit that referenced this pull request Nov 28, 2022
Fixes envoyproxy/envoy-mobile#617, which resulted in duplicate symbols from protobuf being present in the final Envoy Mobile binary.

The [root cause](bazelbuild/bazel#10484) is being [patched in Bazel](bazelbuild/bazel#10493). In the meantime, we can revert these two commits in protobuf ([first](protocolbuffers/protobuf@7b28278), [second](protocolbuffers/protobuf@a03d332)) and apply the diff as a patch until we're able to update Bazel with the fix. More context on the protobuf issue is available [here](protocolbuffers/protobuf#7046).

Notes:
- This should be reverted after updating Bazel with the fix
- The version of protobuf specified in the `http_archive` should be in lockstep with upstream Envoy's version
- Upstream Envoy's `@envoy//bazel:protobuf.patch` patch is also included here

Signed-off-by: Michael Rebello <[email protected]>
Signed-off-by: JP Simard <[email protected]>
jpsim pushed a commit that referenced this pull request Nov 29, 2022
Fixes envoyproxy/envoy-mobile#617, which resulted in duplicate symbols from protobuf being present in the final Envoy Mobile binary.

The [root cause](bazelbuild/bazel#10484) is being [patched in Bazel](bazelbuild/bazel#10493). In the meantime, we can revert these two commits in protobuf ([first](protocolbuffers/protobuf@7b28278), [second](protocolbuffers/protobuf@a03d332)) and apply the diff as a patch until we're able to update Bazel with the fix. More context on the protobuf issue is available [here](protocolbuffers/protobuf#7046).

Notes:
- This should be reverted after updating Bazel with the fix
- The version of protobuf specified in the `http_archive` should be in lockstep with upstream Envoy's version
- Upstream Envoy's `@envoy//bazel:protobuf.patch` patch is also included here

Signed-off-by: Michael Rebello <[email protected]>
Signed-off-by: JP Simard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants